About the project

This course is about Open Data, Open Science and Data Science. We learn to use the R programming language!

Here is a link to my repository


Exercises for week 2

This week I have learned how to edit datasets, and how to represent the dataset graphically with R. I have also practiced evaluating the results from a linear regression model.

The dataset

First I name my dataset learning2014 and read my dataset that will be used for analysis into R.

learning2014 <- read.table("/Users/Noora/Documents/IODS-project/data/learning2014.txt", header = TRUE)

The dataset contains information about students who have taken an exam for a specific course (Johdatus yhteiskunta- tilastotieteeseen). In the dataset I have combined information about each participating student’s gender, age, attitude (global attitude toward statistics), deep learning score, strategic learning score, surface learning score and exam points. Scores for attitude, and deep, strategic and surface learning have been formulated from each student’s responses to a questionnare, and are shown in the dataset using the Likert scale with 5 levels.

I don’t show the full dataset here, but it could be shown just by writing code learning2014 in R. I show, however, the first 6 observations of the dataset just to give a reader unfamiliar with the dataset some kind of idea about the dataset. In my code I first tell R to load knitr package that I have installed. I use knitr to make the table visually better (easier to read), this is the knitr::kable part of the code below, and use the head code to show the first 6 observations.

library(knitr)
knitr::kable(head(learning2014))
gender age attitude deep stra surf points
F 53 3.7 3.583333 3.375 2.583333 25
M 55 3.1 2.916667 2.750 3.166667 12
F 49 2.5 3.500000 3.625 2.250000 24
M 53 3.5 3.500000 3.125 2.250000 10
M 49 3.7 3.666667 3.625 2.833333 22
F 38 3.8 4.750000 3.625 2.416667 21

I can explore the dimensions of my dataset to find out how many rows and columns it has.

dim(learning2014)
## [1] 166   7

This shows that the dataset has 166 rows (observations) and 7 columns (variables).

I can also explore the structure of the dataset.

str(learning2014)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...

This shows again that the dataset has 166 observations and 7 variables, but the output also shows the names of the variables (e.g. gender, age) and the type of data they contain (e.g. integers for age) and some of the first observations.

Overview and summaries of the data

Next, I look at a graphical overview of the data. I use the packages GGally and ggplot2 here.

library(GGally)
library(ggplot2)
GraphicalOverview <- ggpairs(learning2014, mapping = aes(col = gender, alpha = 0.5), title = "Graphical Overview", lower = list(combo = wrap("facethist", bins = 20)))
GraphicalOverview

I can also get summaries of all the variables in my dataset.

Summary of variable gender:

summary(learning2014$gender)
##   F   M 
## 110  56

Summary of variable age:

summary(learning2014$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   17.00   21.00   22.00   25.51   27.00   55.00

Summary of variable attitude:

summary(learning2014$attitude)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.400   2.600   3.200   3.143   3.700   5.000

Summary of variable deep:

summary(learning2014$deep)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.583   3.333   3.667   3.680   4.083   4.917

Summary of variable stra:

summary(learning2014$stra)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.250   2.625   3.188   3.121   3.625   5.000

Summary of variable surf:

summary(learning2014$surf)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.583   2.417   2.833   2.787   3.167   4.333

Summary of variable points:

summary(learning2014$points)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    7.00   19.00   23.00   22.72   27.75   33.00

The summaries give basic information about the variables. For all variables except gender, the output shows the minimum and maximum values of the variable in my dataset, as well as other key figures, i.e. the mean, median and 1st and 3rd quartile values. For example for variable points I see that all values are between 7 and 33, and 50% of the values are between 19 and 27.75.

For gender, the output shows the frequency of events, i.e. that out of the students, 110 are females and 56 males. The first row of the graphical overview also gives this same information. The same row also gives information about the minimum, maximum, median and 1st and 3rd quartile values for each variable for the two different genders (e.g. I can see the median age of males in the dataset). These are difficult to see accurately from the graphical overview picture, however.

The graphical overview shows also correlation values for all variable pairs and the scatter plots can give some indication of the relationship between variables. For example it can be seen that when looking at attitude, the highest point values correspond to high attitude values and vice versa. For age, however, both the highest and lowest point values seem to correspond to relatively low age values. Thus I can assume that there is a relationship between attitude and points but not between age and points.

Creating a regression model

Next I will create a regression model with three explanatory variables age, gender and attitude, and exam points as the target variable. In other words, I am trying to invstigate if the variables age, gender and attitude can be used to predict the value of exam points. Earlier I saw that attitude might be related to points but age most likely isn’t, so let’s see what this regression model reveals!

I will conduct some tests on my model to see if the three variables can be used to predict the value of exam points or not. Let’s view the summary of the model:

Model <- lm(points ~ age + gender + attitude, data = learning2014)
summary(Model)
## 
## Call:
## lm(formula = points ~ age + gender + attitude, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.4590  -3.3221   0.2186   4.0247  10.4632 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.42910    2.29043   5.863 2.48e-08 ***
## age         -0.07586    0.05367  -1.414    0.159    
## genderM     -0.33054    0.91934  -0.360    0.720    
## attitude     3.60657    0.59322   6.080 8.34e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.315 on 162 degrees of freedom
## Multiple R-squared:  0.2018, Adjusted R-squared:  0.187 
## F-statistic: 13.65 on 3 and 162 DF,  p-value: 5.536e-08

The result shows that with this model, it is possible to explain about 20% of the variance of points. The regression line intercepts the y-axis at y-value 13.4 and attitude is positively related to points (increase in attitude increases points) whereas the other variables are negatively related.

The results of this test indicate that the null hypothesis that the variable’s coefficient is 0 is valid for age and gender, as the P values of both are greater than 0.05 (0.159 and 0.720 respectively). Based on this information I should remove these two variables. Attitude, however, seems to be statistically significant and can be left in the model. Of course there might be reasons why one of the other variables should be kept in the model and not removed just because of its P-value but I don’t have any expertise in this topic that could help me justify keeping variables with high P values.

Second regression model

Let’s conduct the same test but this time with just the variable attitude.

Model2 <- lm(points ~ attitude, data = learning2014)
summary(Model2)
## 
## Call:
## lm(formula = points ~ attitude, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

According to this test, attitude is statistically significant, so I keep it in out model!

The relationships between the variables in the model

Based on the output of the summary I see that the effect of attitude on exam points is 3.5 and the variables are positively related. In other words, if attitude improves by 1, exam points improve by 3.5.

I can also see the value of parameter a (alpha), i.e. the point where the regression line intercepts the y-axis. This value in my model is 11.6. This implies that a person with attitude 0 would get 11.6 points from the exam. I can also see that the standard error of attitude is 0.6 and the P-value very small (4.12 x 10^-09).

The multiple r-squared value tells how well the model explains the target variable exam points. I see that around 19% of the variation of points is explained by attitude. I can say that attitude seems to contribute to success in the exam (exam points), but there are also clearly other variables that affect success in the exam as well.

Diagnostic plots

I can also draw diagnostic plots about my model. I want the diagnostic plots to sho residuals vs. fitted values, normal QQ-plot and residuals vs leverage. These will be explained in analysis of the diagnostic plots.

par(mfrow=c(2,2))
plot(Model2, which = c(1, 2, 5))

Because my model is about linear regression, it assumes that there is a linear relationship between attitude and exam points. Another important assumption is that there is a random variable (e) that adds noise to the observations This e describes the errors of the model. The linear regression model has several assumptions related to the errors. It is assumed that the errors 1) are normally distributed, 2) aren’t correlatd, and 3) have constant variance (i.e. the size of an error doesn’t depend on the explanatory variables, in this case attitude).

Analysing the residuals of the model allows me to investigate the validity of these assumptions.

The first plot, residuals vs. fitted, can be used to evaluate if the errors have constant variance. If there is any pattern in the scatter plot, there is a problem with the assumptions of the model. However, there doesn’t appear to be any pattern in my scatter plot, the spread of points seems random. Thus the plot doesn’t imply there would be problems with my model’s assumptions.

The second plot, normal QQ, can be used to evaluate the normality of the errors. The better the points fit to the line in the plot, the better they fit the normality assumption. From the QQ-plot I can see that most of the findings fit the normality assumption, except for in the beginning and end of the line, where there is more deviation from the line. Thus the normality assumption might be questionable.

The third plot, residuals vs. leverage, can be used to see how great impact a single obsrvation has on the model. There might be e.g. one observation that is causing the slope of the regression line to significantly change. In my plot, there are a couple of observations at the far right side of the plot. However, the leverage of these observations is around 0.04 so extremely low. Thus no single observation seems to have a high impact to my model.

Based on the diagnostic plots, I can reasonably assume that my model’s assumptions are valid, meaning that also the results of tests are valid.


Exercises for week 3

The student alcohol consumption dataset

First I name my dataset studnt_alc and read it to R from my local folder.

student_alc <- read.table("/Users/Noora/Documents/IODS-project/data/create_alc.txt", header = TRUE)

The names of the variables in the dataset are:

colnames(student_alc)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "nursery"    "internet"   "guardian"   "traveltime"
## [16] "studytime"  "failures"   "schoolsup"  "famsup"     "paid"      
## [21] "activities" "higher"     "romantic"   "famrel"     "freetime"  
## [26] "goout"      "Dalc"       "Walc"       "health"     "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"

The dataset contains information about student alcohol use in Portugal. The original dataset that is available from here has been modified so that two separate datasets have been joined. These joined dataset has information about each student (e.g. age, gender, mother’s job, whether they have internet access or not), about their alcohol consumption (e.g. workday & weekend alcohol consumption), as well as about their school performance on mathematics and Portuguese. The variables shown above give a fuller picture of all the 35 variables in the dataset. The dataset has 382 observations.

Choosing four variables to study

I will next study the relationships between high/low alcohol consumption and the following four variables: studytime, age, failures, and romantic. Here are my personal hypotheses about how each variable is related to alcohol consumption:

  • studytime: students with high alcohol consumption spend less time studying (negatively related to high use)
  • age: students of different ages have similar alcohol consumption rates (no relationship)
  • failures: students with high alcohol consumption have more failures (positively related to high use)
  • romantic: students with a romantic relationship have high alcohol consumption, single students have lower consumption

Exploring the variables

I will next look at the distributions of the variables by printing out summaries of them:

summary(student_alc$studytime)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   2.000   2.037   2.000   4.000
summary(student_alc$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   15.00   16.00   17.00   16.59   17.00   22.00
summary(student_alc$failures)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.2016  0.0000  3.0000
summary(student_alc$romantic)
##  no yes 
## 261 121
summary(student_alc$high_use)
##    Mode   FALSE    TRUE    NA's 
## logical     268     114       0

Study time describes the student’s weekly study time on a scale of 1 to 4 (1 = <2h, 2 = 2-5h, 3 = 5-10h, 4 = >10h). The mean score for study time in the dataset is 2.037, so the averag student studies between 2 and 5 hours per week. There are some who study less than 2h and some more than 10h (min is 1 and max is 4).

The age distribution in the dataset is between 15 and 22 with on average th students being 17. Most students have no past failures, with the median of failures being 0 and mean being 0.2016. The maximum number of failures is 3. Most students are not in a romantic relationship (261 out of 382, i.e. almost 70%).

Additionally, I explored the amount of students belonging to high_use group. The majority of students (n = 268) don’t consume alcohol a lot, but a large minority (n = 114) do.

Let’s next investigate some plots about the variables.

Study time

library(ggplot2)
ba1 <- ggplot(student_alc, aes(x = studytime, col = high_use))
ba1 + geom_bar(position = "dodge") + ggtitle("Study time")

bo1 <- ggplot(student_alc, aes(x = high_use, y = studytime, col = high_use))
bo1 + geom_boxplot() + ggtitle("Study time")

It is clear from the bar plot that although approximately the same number of students without high use have study time 1 and 3, for those with high use, the number of students with study time 1 is much greater than those with 3. Also, there are over two times more students without high use whose study time is 2 than there are those with study time 1. For students with high use, the difference is much smaller. Thus it could be assumed that my hypothesis that students with high alcohol consumption spend less time studying than those without high use seems to be supported. The box plot also supports this view.

Age

ba2 <- ggplot(student_alc, aes(x = age, col = high_use))
ba2 + geom_bar(position = "dodge") + ggtitle("Age")

bo2 <- ggplot(student_alc, aes(x = high_use, y = age, col = high_use))
bo2 + geom_boxplot() + ggtitle("Age")

The bar and box plots suggest that there might be some differences in students’ alcohol consumption depending on their age. Relatively small percentage of students aged 15 have high use of alcohol, but as the students get older, the percentage of students of same age who have high use seems to increase. Thus my hypothesis that students of different ages have similar alcohol consumption rates might be incorrect.

Failures

ba3 <- ggplot(student_alc, aes(x = failures, col = high_use))
ba3 + geom_bar(position = "dodge") + ggtitle("Failures")

bo3 <- ggplot(student_alc, aes(x = high_use, y = failures, col = high_use))
bo3 + geom_boxplot() + ggtitle("Failures")

From the bar plot I see that there are about the same number of students with and without high use who have 1, 2 and 3 failures. This is interesting because there are far more students without high use than there are with high use. This would suggest the percentage of students with high use who have failures is higher than that of those without high use. This would suggest that my hypothesis that students with high alcohol consumption have more failures is correct. However, it is also clear from both of the plots that the number of students with failures is in both cases (with and without high use) quite small compared to the number of students without failures.

Romantic

ba4 <- ggplot(student_alc, aes(x = romantic, col = high_use))
ba4 + geom_bar(position = "dodge") + ggtitle("Romantic")

From the bar plot I see that most students don’t have a romantic relationship. Those without high use seem a bit more likely to have a relationship than those with high use, but the difference is quite small. Thus my hypothesis that students with a romantic relationship have high alcohol consumption, single students have lower consumption is not supported by this plot.

Next I still look at some cross tabulations of the variables. I use the library knitr to present the results of the code as nicer looking tables (the knitr::kable part of my code does this).

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
## 
##     nasa
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(knitr)
crosstab1 <- student_alc %>% group_by(high_use, romantic) %>% summarise(count = n())
crosstab2 <- student_alc %>% group_by(high_use) %>% summarise(mean_studytime = mean(studytime), mean_age = mean(age),mean_failures = mean(failures))
knitr::kable(crosstab1)
high_use romantic count
FALSE no 180
FALSE yes 88
TRUE no 81
TRUE yes 33
knitr::kable(crosstab2)
high_use mean_studytime mean_age mean_failures
FALSE 2.149254 16.50000 0.1417910
TRUE 1.771930 16.78947 0.3421053

From the first cross-tabulation with high use and romantic variables, I see that 88 out of the 268 students without high use of alcohol have a relationship (i.e. 33%), and 33 out of the 114 with high use have a romantic relationship (i.e. 29 %). This again just shows that my hypothesis is likely to be wrong about romantic relationships and alcohol consumption.

From the second cross-tabulation I see that the average study time for those without high use is indeed greater than for those with high use (although the difference is not very large), supporting my earlier discussion about study time. The average age of students with and without high use seems quite similar, which might indicate that my hypothesis about age not being related to alcohol consumption might be actually true. For failures, the difference between the means is small, but it might still be possible my hypothesis is correct.

So far I have mixed results with my hypotheses and will conduct more analysis.

Logistic regression

First I fit a logistic regression model with high use as the target variable and study time, age, failures and romantic as the predictors. I create a summary of the fittd model:

mymodel <-glm(high_use ~ studytime + age + failures + romantic, family = "binomial", data = student_alc)
summary(mymodel)
## 
## Call:
## glm(formula = high_use ~ studytime + age + failures + romantic, 
##     family = "binomial", data = student_alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4680  -0.8575  -0.7179   1.2735   2.2796  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -3.1149     1.6999  -1.832 0.066888 .  
## studytime    -0.5483     0.1583  -3.464 0.000531 ***
## age           0.2000     0.1027   1.948 0.051458 .  
## failures      0.3080     0.1940   1.588 0.112373    
## romanticyes  -0.2132     0.2546  -0.838 0.402263    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 439.96  on 377  degrees of freedom
## AIC: 449.96
## 
## Number of Fisher Scoring iterations: 4

From the summary I see that study time is negatively related to high use (coefficient -0.55), and both age and failures are positively related (coefficients 0.20 and 0.31 respectively). Romantic relationships seem to be negatively related to high use (-0.21).

When looking at the p-values, study time is statistically significant (with p-value much below 0.05), meaning that it’s coefficient is not 0 (it is related to high use). For age, the p-value is 0.051, so it is difficult to say if it is or isn’t statistically significant. Failures has p-value of 0.11, and thus could probably be removed from the model (th null hypothesis of the test that failures’ coefficient is 0 seems valid). For romantic relationships, the p-value is 0.40 suggesting that it isn’t statistically significant and can be removed from the model. I will not remove any varibles yet.

I will also look at the coefficients of my model (as odds ratios) and their confidence intervals.

odds_r <- coef(mymodel) %>% exp
conf_int <- confint(mymodel) %>% exp
## Waiting for profiling to be done...
cbind(odds_r, conf_int)
##                 odds_r       2.5 %    97.5 %
## (Intercept) 0.04438302 0.001526459 1.2137417
## studytime   0.57795482 0.419696413 0.7815616
## age         1.22142684 1.000405322 1.4977093
## failures    1.36071556 0.931252861 2.0033979
## romanticyes 0.80795889 0.486401760 1.3227895

From the odds ratios and their 95% confidence intervals I can see that two confidence intervals contain 1: those of failures and romantic. This means that the odds of success with X (e.g. romantic relationship) are the same as odds of success without it. This suggest that these two variables are not related to high use (their presence doesn’t change the odds of having high use). This again confirms my earlier suggestions that these two variables can be removed from the model.

Study time, on the other hand, doesn’t contain the value 1 in its confidence interval. Thus it seems that study time does have a relationship with high use and should be kept in the model. The odds ratio for study time is less than 1, suggesting that study tim is negatively associated with high use, as also found earlier.

Age, again, is more complex to interpret. Its confidence interval comes very close to 1 (1.0004), so it is again difficult to say that it does or doesn’t clearly have a relationship with high use. It seems that age is positively related with high use, as also mentioned before, but that the effect is fairly weak (odds ratio is 1.22 so close to 1). I will keep age in my model for now as the evidence against keeping it is not very strong.

When comparing these results to my hypotheses, it seems that only my hypothesis about study time is supported by the model (study time is negatively related to high use). My hypothesis about age (no relationship) might or might not be correct - it seems there is a weak relationship. My hypotheses about failures and romantic seem both incorrect as my model suggests no relationship between either of these variables and high use.

The predictive power of my model

So, now I change my model to only include study time and age as the predictors.

mymodel2 <-glm(high_use ~ studytime + age, family = "binomial", data = student_alc)

I will look at my model’s predictions and the actual values of my data next.

probabilities <- predict(mymodel2, type = "response")
student_alc <- mutate(student_alc, probability = probabilities)
student_alc <- mutate(student_alc, prediction = probabilities > 0.5)
table(high_use = student_alc$high_use, prediction = student_alc$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   259    9
##    TRUE    102   12

Here is a graphic visualising the actual values and the predictions:

g <- ggplot(student_alc, aes(x = probability, y = high_use, col=prediction))
g + geom_point()

The total proportion of inaccurately classified individuals is calculated with a loss function:

loss_funct <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}
loss_funct(class = student_alc$high_use, prob = student_alc$probability)
## [1] 0.2905759

It seems that the proportion of inaccurately classified individuals is 0.29 (29%). The smaller this number is the better the model is at predictions, so my model seems ok but there is room for improvement as nearly a third of individuals are classified in a wrong way. My model seems to mainly inaccurately classify those with high use to not have high use. Out of the 114 with high use, my model classifies only 12 to have high use. However, out of the 268 who don’t have high use, my model correctly classifies 259 individuals.

The performance of my model should still be better than the performance of a simple guessing strategy. For example, let’s say I guess that all individuals with little time spent studying (study time 1 or 2) have high use of alcohol, and all those with much time spent studying (study time 3 or 4) don’t have high use. I use the xtabs function to get a table showing the number of students with high use or without high use with differnt study times. I again use the knitr library to present the table in a nicer format.

guess <- xtabs(~ high_use + studytime, student_alc)
knitr::kable(guess)
1 2 3 4
FALSE 58 135 52 23
TRUE 42 60 8 4

I see that for there are 58 + 135 = 193 students whose study time is 1 or 2 but who don’t have high use of alcohol. There are also 8 + 4 = 12 students with study time 3 or 4 who have high use of alcohol. Thus my guess would have been incorrect in 193 + 12 = 205 out of 382 cases, so in 54% of cases. Thus my model is far more accurate than my simple guessing strategy.


Exercises for week 4

The Boston dataset

I load the Boston data and explore its structure and dimensions.

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
data("Boston")
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506  14

The Boston dataset has 506 observations and 14 variables. The dataset relates to housing valus in Boston’s suburbs. The variables include for example crim (per capita crime rate by town), dis (weighted mean of distances to five Boston employment centres) and ptratio (pupil-teacher ratio by town).

Overview of the data

Here is a graphical overview of the data.

library(GGally)
library(ggplot2)
GraphicalOverview <- ggpairs(Boston, mapping = aes(), title = "Graphical Overview", lower = list(combo = wrap("facethist", bins = 20)))
GraphicalOverview

Here is also a correlation plot of the variables:

library(corrplot)
cor_matrix <- cor(Boston)
corrplot(cor_matrix, method = "circle", type = "upper", cl.pos = "r", tl.pos = "d")

Here are summaries of the variables. The package knitr is used to present the results in a nicer table format (the knitr::kable part of the code).

library(knitr)
knitr::kable(summary(Boston))
crim zn indus chas nox rm age dis rad tax ptratio black lstat medv
Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000 Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130 Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32 Min. : 1.73 Min. : 5.00
1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38 1st Qu.: 6.95 1st Qu.:17.02
Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000 Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207 Median : 5.000 Median :330.0 Median :19.05 Median :391.44 Median :11.36 Median :21.20
Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917 Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795 Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67 Mean :12.65 Mean :22.53
3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23 3rd Qu.:16.95 3rd Qu.:25.00
Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000 Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127 Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90 Max. :37.97 Max. :50.00

The variables are quite interesting. For example per capita crime rate by town (crim) seems to be quite low in most cases (3rd quartile is 3.68), but the maximum value for crime rate is 88.98, suggesting that some areas are highly different from the more typical areas when it comes to crime rates. For pupil-teacher ratio by town (ptratio), the 1st quartile value is 17.40 and 3rd is 20.20, meaning that 50% of areas have between 17.4 and 20.2 pupils per teacher. However, even with this variable, the difference between min (12.60) and max (22.00) is significant, with in some areas one teacher having one average 10 pupils more than in other areas. Also the proportion of owner-occupied units built prior to 1940 (age) seems to have a large range (min 2.90, max 100.00). There is also a ten-fold difference between the min and max (median) values of homes. The average number of rooms per dwelling (rm) seems to be the only variable that is somewhat normally distributed.

The scatter plots in the graphical overview can give some indication about the relationships between the variables. For example, when looking at home values (medv), the highest home values seem to be in the areas with the lowest lstat (lower status of the population %) values, whereas the high lstat areas have only low home values. The correlation plot also confirms that there is strong (negative) correlation between home values and lstat. The correlation plot also suggests that there is a correlation between the number of rooms per dwelling (rm) and home values. The same is indicated by the scatter plots - the more rooms the dwelling has, the higher the home value, and vice versa. On the other hand, there is no clear relationship between for example proportion of residential land zoned for lots over 25,000 sq.ft (zn) and home value, as high home values seem to correspond to both high and low values of zn. The correlation plot also shows no strong correlation between these two variables.

Standardising the dataset

Next I will scale the dataset and look at a summary of the scaled data (using again knitr to present the summary as a nice table). I also convert the scaled dataset (sBoston) to a data frame format, which will be needed later on.

sBoston <- scale(Boston)
knitr::kable(summary(sBoston))
crim zn indus chas nox rm age dis rad tax ptratio black lstat medv
Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808 Median :-0.1811 Median :-0.1449
Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
sBoston <- as.data.frame(sBoston)

The values of the variables changed so that now the mean value of very variable is 0, and the other values have been scaled with the formula (x - mean(x)) / sd(x). This means that the min and 1st quartile values are below 0 and 3rd quartile and max above 0. The scales of the variables are also closer to each other now.

Next I create a categorical variable ‘crime’ about (scaled) crime rate. The ‘crime’ variable will have 4 level: low, med_low, med_high and high. I also drop the old crime rate variable from my scaled Boston dataset (sBoston) and add the new catgorical crime rate to it.

scaled_crim <- sBoston$crim
bins <- quantile(scaled_crim)
crime <- cut(scaled_crim, breaks = bins, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))
sBoston <- dplyr::select(sBoston, -crim)
sBoston <- data.frame(sBoston, crime)

Next I will divide the scaled dataset to train and test sets, with 80% of the data being in th train set.

n <- nrow(sBoston)
ind <- sample(n, size = n*0.8)
train <- sBoston[ind,]
test <- sBoston[-ind,]

Linear discriminant analysis

Next I will fit the linear discriminant analysis on the train set by using the catgorical crime rate as the target variable and the other variables as predictor variables. I will also draw the LDA (bi)plot but not show the LDA arrows in the biplot.

lda.fit <- lda(crime ~ ., data = train)
classes <- as.numeric(train$crime)
plot(lda.fit, dimen = 2, col = classes, pch = classes)

Predicting the classes

I will next save the correct classes (categories) for crime rate in the test set and then remove the categorical crime rate variable from the test set for testing purposes. After this I will predict the crime classes with my LDA model on the test data and look at whether the predictions were correct.

correct_classes <- test$crime
test <- dplyr::select(test, -crime)
lda.pred <- predict(lda.fit, newdata = test)
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       13       8        1    0
##   med_low    6      13       10    0
##   med_high   0       6       18    0
##   high       0       0        0   27

From my LDA predictions were correct about 70 or 75% of the time (depending on which obsrvations were used as parts of the train and test sets as these change every time I refresh the document). The LDA predictions for high seem to be mostly correct, but the model has some problems predicting the other classes correctly.

The distances between the observations

Next I will reload the Boston dataset and scale it again (and call it sBoston 2 now). I will also calculate the distances between the observations using Euclidean distance measure.

data("Boston")
sBoston2 <- scale(Boston)
sBoston2 <- as.data.frame(sBoston2)
dist_eu <- dist(sBoston2)
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4620  4.8240  4.9110  6.1860 14.4000

Next I will use the K-means clustering method on the dataset. I don’t yet know the optimal number of clusters so I use 5 for this run of the algorithm.

km <- kmeans(dist_eu, centers = 5)
pairs(sBoston2, col = km$cluster, lower.panel = NULL)

However, I want to determine the optimal number of clusters.

set.seed(123)
k_max <- 10
twcss <- sapply(1:k_max, function(k){kmeans(dist_eu, k)$tot.withinss})
plot(1:k_max, twcss, type = 'b')

From this I see that the optimal number of clusters seems to be 2 as this is where the largest drop in the WCSS is. Now I do the k-means clustering and visualise the results with a plot.

km <- kmeans(dist_eu, centers = 2)
pairs(sBoston2, col = km$cluster, lower.panel = NULL)

In this visualisation all the data points are assigned to two different clusters (the red and black clusters). It seems that some variables are more relevant for clustering than others. For example points with low rad (index of accessibility to radial highways) values in general belong to the red cluster, and points with high rad values belong to the black cluster. Thus rad seems useful for clustering purposes. Another relevant variable is e.g. tax (full-value property-tax rate). However, for example the values of the variable chas don’t seem to give any indication about which cluster the points belong to, and thus chas is not very relevant for clustering. Thus this visualisation is helpful in showing whether the variables or variable pairs are good indicators about which cluster a specific data point belongs to.


Exercises for week 5

The Human dataset

I will start off by exploring the structure and dimensions of my ‘human’ dataset.

human <- read.table("/Users/Noora/Documents/IODS-project/data/create_human_.txt", header = TRUE)
dim(human)
## [1] 155   8
str(human)
## 'data.frame':    155 obs. of  8 variables:
##  $ High_educ_f_m: num  1.007 0.997 0.983 0.989 0.969 ...
##  $ F_m_working  : num  0.891 0.819 0.825 0.884 0.829 ...
##  $ Educ_exp     : num  17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
##  $ Life_exp     : num  81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
##  $ GNI_         : int  64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
##  $ Mat_deaths   : int  4 6 6 5 6 7 9 28 11 8 ...
##  $ Young_births : num  7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
##  $ Parl_seats   : num  39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...

From the outputs I see that the dataset has 155 observation and 8 variables. However, I know that the row names denote the variable ‘Country’, thus the dataset has in reality 9 variables, not 8. For example the last row of my dataset is:

library(knitr)
knitr::kable(tail(human, n = 1))
High_educ_f_m F_m_working Educ_exp Life_exp GNI_ Mat_deaths Young_births Parl_seats
Niger 0.3076923 0.4459309 5.4 61.4 908 630 204.8 13.3

Here I have 8 named columns, but the first column shows the name of the country in question (in this case Niger), so in effect there are 9 variables.

The original ‘human’ dataset includes variables that relate to the Human Development Index (HDI), and to other indexes such as Gender Development Index (GDI), and combines indicators from most countries. The dataset originally consisted of more than the 9 variables shown in my dataset, but it has been wrangled and reduced to 9 variables, and to only obsrvations that relate to countries, and it includes only those observations without missing values. The variable ‘Country’ is the name of the country in question. The other 8 variables in my dataset relate to two topics: Health and knowledge, and Empowerment. The variables related to Empowerment in my dataset are:

  • High_educ_f_m: Proportion of females with at least secondary education / Proportion of males with at least secondary education
  • F_m_working: Proportion of females in the labour force / Proportion of males in the labour force
  • Parl_seats: Percetange of female representatives in parliament

The variables related to Health and knowledge in my dataset are:

  • Educ_exp: Expected years of schooling
  • Life_exp: Life expectancy at birth
  • GNI_: Gross National Income per capita
  • Mat_deaths: Maternal mortality ratio
  • Young_births: Adolescent birth rate

The source for the definitions of the variables is this website.

Overview of the data

Here is a graphical overview of the data:

library(ggplot2)
library(GGally)
GraphOV <- ggpairs(human, mapping = aes(), title = "Graphical Overview", lower = list(combo = wrap("facethist", bins = 20)))
GraphOV

Here are also summaries of the variables in my dataset (the knitr part of the code just formats the result into a nice table):

knitr::kable(summary(human))
High_educ_f_m F_m_working Educ_exp Life_exp GNI_ Mat_deaths Young_births Parl_seats
Min. :0.1717 Min. :0.1857 Min. : 5.40 Min. :49.00 Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:11.25 1st Qu.:66.30 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
Median :0.9375 Median :0.7535 Median :13.50 Median :74.20 Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
Mean :0.8529 Mean :0.7074 Mean :13.18 Mean :71.65 Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:15.20 3rd Qu.:77.25 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
Max. :1.4967 Max. :1.0380 Max. :20.20 Max. :83.50 Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50

From the summary table I see immediately that many of the variables have fairly large ranges. For example some countries have no women at all in parliament (as the min value of Parl_seats is 0), whereas some have more women in parliament than men (as the max value is above 50%). This already suggests to me that some countries in the dataset are quite different from each other. Similarly there seems to be large differences between some countries with regard to all the other variables. For example in some countries the expected years of schooling are under 6 and in others above 20. At the same time, expected years of schooling seems to be among those variables where many countries also are fairly similar, with 50% of countries expecting inhabitants to have between 11 and 15 years of schooling. This variable is also the one that seems to be the closest to being normally distributd from all the variables when looking at the graphical overview.

The graphical overview indicates that some of the variables may be related to one another. For example it seems that as adolescent birth rate increase, so does maternal mortality ratio. Not surprisingly, there also seems to be a relationship between maternal mortality ratio and life expectancy, with higher maternal mortality ratios corresponding to lower lif expectancies. In fact, a great number of the variables seem to be correlated. Another example of such variables are education and life expectancies: the higher the education expectancy, the higher also the life expectancy and vice versa. Among those variable pairs that don’t seem to be strongly correlated are e.g. expected years of schooling and percetange of female representatives in parliament.

Principal component analysis (PCA)

Next I will perform principal component analysis (PCA) without standardising the dataset. First I will look at the variability captured by the principal components. I will also draw a biplot displaying the observations with PC1 in coordinate x-axis and PC2 in y-axis.

pca_human <- prcomp(human)
summary(pca_human)
## Importance of components:
##                              PC1      PC2   PC3   PC4   PC5   PC6    PC7
## Standard deviation     1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912
## Proportion of Variance 9.999e-01   0.0001  0.00  0.00 0.000 0.000 0.0000
## Cumulative Proportion  9.999e-01   1.0000  1.00  1.00 1.000 1.000 1.0000
##                           PC8
## Standard deviation     0.1591
## Proportion of Variance 0.0000
## Cumulative Proportion  1.0000
biplot(pca_human, choices = 1:2, cex = c(1, 1), col = c("grey40", "deeppink2"), sub = "Plot 1: PC1 & PC2 with non standardised dataset")
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

However, I should first standardise the variables to be able to conduct analysis, so I repeat the PCA, and look at the variability and biplot of the tandardised dataset next.

human_std <- scale(human)
pca_human_std <- prcomp(human_std)
summary(pca_human_std)
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6
## Standard deviation     2.0708 1.1397 0.87505 0.77886 0.66196 0.53631
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595
## Cumulative Proportion  0.5361 0.6984 0.79413 0.86996 0.92473 0.96069
##                            PC7     PC8
## Standard deviation     0.45900 0.32224
## Proportion of Variance 0.02634 0.01298
## Cumulative Proportion  0.98702 1.00000
biplot(pca_human_std, choices = 1:2, cex = c(1, 1), col = c("grey40", "deeppink2"), sub = "Plot 2: PC1 & PC2 with standardised dataset")

When comparing the xxx. Interpret the results of both analysis (with and without standardizing). Are the results different? Why or why not? Include captions in you plots where you describe the results by using not just your variable names, but the actual phenomenons they relate to

Interpreting PC1 and PC2

Give your personal interpretations of the first two principal component dimensions based on the biplot drawn after PCA on the standardized human data

The tea dataset

Next I move to another dataset - the tea dataset from the FactoMineR package.

library(FactoMineR)
data("tea")
str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
##  $ frequency       : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
dim(tea)
## [1] 300  36

The dataset seems to have 36 variables and 300 observations. The variables are e.g. age, sex, lunch and dinner, and seem to relate to tea consumption. I want to only keep some interesting variables in my analysis, so I choose to focus on variables:

  • frequency: ‘1/day’, ‘1 to 2/week’, ‘+2/day’ and ‘3 to 6/week’
  • Tea: ‘Earl Grey’, ‘green’ and ‘black’
  • sugar: ‘sugar’ and ‘no sugar’
  • tea.time: ‘not tea time’ and ‘tea time’
  • sex: ‘female’ (F) and ‘male’ (M)

I will next visualise these variables from the dataset.

library(tidyr)
library(dplyr)
keep_columns <- c("frequency", "Tea", "sugar", "tea.time", "sex")
te1 <- dplyr::select(tea, one_of(keep_columns))
gather(te1) %>% ggplot(aes(value)) + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8)) + facet_wrap("key", scales = "free")
## Warning: attributes are not identical across measure variables; they will
## be dropped

Multiple correspondence analysis and the tea data

Next I will perform MCA on the variables of the tea dataset that I kept for analysis.

mca_tea <- MCA(te1, graph = FALSE)
summary(mca_tea)
## 
## Call:
## MCA(X = te1, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6
## Variance               0.316   0.237   0.217   0.201   0.189   0.177
## % of var.             19.737  14.823  13.555  12.562  11.836  11.040
## Cumulative % of var.  19.737  34.559  48.114  60.676  72.512  83.552
##                        Dim.7   Dim.8
## Variance               0.146   0.117
## % of var.              9.112   7.336
## Cumulative % of var.  92.664 100.000
## 
## Individuals (the 10 first)
##                 Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3    ctr
## 1            |  0.739  0.576  0.302 |  0.514  0.372  0.146 |  0.162  0.040
## 2            | -0.171  0.031  0.018 |  0.628  0.555  0.243 | -0.186  0.053
## 3            | -0.785  0.650  0.714 | -0.250  0.088  0.073 | -0.129  0.026
## 4            |  0.929  0.911  0.661 | -0.179  0.045  0.024 | -0.287  0.127
## 5            |  0.152  0.024  0.021 |  0.277  0.108  0.068 | -0.039  0.002
## 6            |  0.507  0.272  0.201 |  0.227  0.073  0.040 | -0.443  0.302
## 7            | -0.028  0.001  0.000 |  0.041  0.002  0.001 |  0.358  0.197
## 8            | -0.486  0.249  0.105 |  0.315  0.139  0.044 |  0.799  0.982
## 9            | -0.298  0.094  0.087 |  0.042  0.002  0.002 |  0.062  0.006
## 10           | -0.038  0.002  0.001 |  0.970  1.323  0.581 |  0.410  0.258
##                cos2  
## 1             0.014 |
## 2             0.021 |
## 3             0.019 |
## 4             0.063 |
## 5             0.001 |
## 6             0.153 |
## 7             0.055 |
## 8             0.283 |
## 9             0.004 |
## 10            0.104 |
## 
## Categories (the 10 first)
##                  Dim.1     ctr    cos2  v.test     Dim.2     ctr    cos2
## 1/day        |   0.394   3.108   0.072   4.634 |  -0.037   0.036   0.001
## 1 to 2/week  |   0.772   5.543   0.103   5.538 |  -0.227   0.636   0.009
## +2/day       |  -0.604   9.772   0.268  -8.944 |   0.084   0.254   0.005
## 3 to 6/week  |   0.155   0.173   0.003   0.960 |   0.082   0.064   0.001
## black        |  -0.456   3.243   0.068  -4.508 |   1.055  23.164   0.365
## Earl Grey    |   0.080   0.261   0.012   1.860 |  -0.632  21.698   0.721
## green        |   0.553   2.132   0.038   3.363 |   1.332  16.467   0.219
## No.sugar     |  -0.573  10.743   0.351 -10.243 |   0.478   9.946   0.244
## sugar        |   0.612  11.483   0.351  10.243 |  -0.511  10.632   0.244
## Not.tea time |   0.712  14.036   0.393  10.846 |   0.323   3.836   0.081
##               v.test     Dim.3     ctr    cos2  v.test  
## 1/day         -0.435 |  -0.885  22.865   0.363 -10.416 |
## 1 to 2/week   -1.625 |   1.173  18.618   0.237   8.411 |
## +2/day         1.249 |   0.056   0.122   0.002   0.829 |
## 3 to 6/week    0.505 |   0.745   5.803   0.071   4.607 |
## black         10.441 |   0.942  20.177   0.290   9.319 |
## Earl Grey    -14.687 |  -0.103   0.632   0.019  -2.397 |
## green          8.099 |  -1.508  23.079   0.281  -9.169 |
## No.sugar       8.542 |  -0.176   1.468   0.033  -3.138 |
## sugar         -8.542 |   0.188   1.569   0.033   3.138 |
## Not.tea time   4.913 |  -0.132   0.706   0.014  -2.015 |
## 
## Categorical variables (eta2)
##                Dim.1 Dim.2 Dim.3  
## frequency    | 0.294 0.012 0.514 |
## Tea          | 0.089 0.727 0.476 |
## sugar        | 0.351 0.244 0.033 |
## tea.time     | 0.393 0.081 0.014 |
## sex          | 0.452 0.122 0.048 |

Interpret the results of the MCA and draw at least the variable biplot of the analysis. You can also explore other plotting options for MCA. Comment on the output of the plots

plot(mca_tea, invisible = c("ind"), habillage = "quali")